智能论文笔记

Sketch-and-solve approaches to k-means clustering by semidefinite programming

Charles Clum , Dustin G. Mixon , Soledad Villar , Kaiying Xie

分类：机器学习 | (统计)机器学习

2022-11-28

We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering. When the data is appropriately separated we identify the k-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value. This lower bound is data-driven; it does not make any assumption on the data nor how it is generated. We provide code and an extensive set of numerical experiments where we use this approach to certify approximate optimality of clustering solutions obtained by k-means++.

translated by 谷歌翻译

Graph Neural Networks for Community Detection on Sparse Graphs

Luana Ruiz , Ningyuan , Huang , Soledad Villar

分类：机器学习

2022-11-06

Spectral methods provide consistent estimators for community detection in dense graphs. However, their performance deteriorates as the graphs become sparser. In this work we consider a random graph model that can produce graphs at different levels of sparsity, and we show that graph neural networks can outperform spectral methods on sparse graphs. We illustrate the results with numerical examples in both synthetic and real graphs.

translated by 谷歌翻译

Equivariant maps from invariant functions

Ben Blum-Smith , Soledad Villar

分类： (统计)机器学习 | 机器学习

2022-09-29

在模棱两可的机器中，学习的想法是将学习限制为假设类别，在某些群体行动方面，所有功能都是均等的。通常使用不可约说的表示或不变理论来参数化此类函数的空间。在本说明中，我们解释了归因于Malgrange的一般过程，以表达线性空间之间的所有多项式图，这些线性空间相对于组$ G $的作用，鉴于对较大空间的不变多项式的表征。该方法还可以在$ g $是一个紧凑的谎言组的情况下参数光滑的模糊图。

translated by 谷歌翻译

From Local to Global: Spectral-Inspired Graph Neural Networks

Ningyuan Huang , Soledad Villar , Carey E. Priebe , Da Zheng , Chengyue Huang , Lin Yang , Vladimir Braverman

分类： (统计)机器学习 | 机器学习

2022-09-24

图神经网络（GNN）是非欧盟数据的强大深度学习方法。流行的GNN是通信算法（MPNNS），它们在本地图中汇总并结合了信号。但是，浅的mpnns倾向于错过远程信号，并且在某些异质图上表现不佳，而深度mpnns可能会遇到过度平滑或过度阵型等问题。为了减轻此类问题，现有的工作通常会从欧几里得数据上训练神经网络或修改图形结构中借用归一化技术。然而，这些方法在理论上并不是很好地理解，并且可能会提高整体计算复杂性。在这项工作中，我们从光谱图嵌入中汲取灵感，并提出$ \ texttt {powerembed} $ - 一种简单的层归一化技术来增强mpnns。我们显示$ \ texttt {powerembed} $可以证明图形运算符的顶部 - $ k $引导特征向量，该算子可以防止过度光滑，并且对图形拓扑是不可知的；同时，它产生了从本地功能到全球信号的表示列表，避免了过度阵列。我们将$ \ texttt {powerembed} $应用于广泛的模拟和真实图表，并展示其竞争性能，尤其是对于异性图。

translated by 谷歌翻译

MarkerMap: nonlinear marker selection for single-cell studies

Nabeel Sarwar , Wilson Gregory , George A Kevrekidis , Soledad Villar , Bianca Dumitrascu

分类： (统计)机器学习 | 机器学习

2022-07-28

单细胞RNA-seq数据允许在不断增长的一组生物环境中定量细胞类型差异。但是，确定了一小部分基因组特征来解释这种变异性可能是错误的，并且在计算上很棘手。在这里，我们介绍了MarkerMap，这是一种用于选择最小基因集的生成模型，这些基因集对细胞类型的起源提供最大信息，并启用整个转录组重建。MarkerMap为旨在识别特定细胞类型种群的监督标记选择提供了可扩展的框架，以及针对基因表达插补和重建的无监督标记选择。我们基于Markermap的竞争性能，以实现对真实单细胞基因表达数据集的先前发表的方法。MarkerMap可作为可安装的PIP软件包获得，可作为旨在开发可解释的机器学习技术的社区资源，以增强单细胞研究中的可解释性。

translated by 谷歌翻译

Dimensionless machine learning: Imposing exact units equivariance

Soledad Villar , Weichi Yao , David W. Hogg , Ben Blum-Smith , Bianca Dumitrascu

分类： (统计)机器学习 | 机器学习

2022-04-02

Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis, and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods which are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology.

translated by 谷歌翻译

Computing the Performance of A New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards

James K. He , Sofía S. Villar , Lida Mavrogonatou

分类：机器学习

2023-01-03

Designing experiments often requires balancing between learning about the true treatment effects and earning from allocating more samples to the superior treatment. While optimal algorithms for the Multi-Armed Bandit Problem (MABP) provide allocation policies that optimally balance learning and earning, they tend to be computationally expensive. The Gittins Index (GI) is a solution to the MABP that can simultaneously attain optimality and computationally efficiency goals, and it has been recently used in experiments with Bernoulli and Gaussian rewards. For the first time, we present a modification of the GI rule that can be used in experiments with exponentially-distributed rewards. We report its performance in simulated 2- armed and 3-armed experiments. Compared to traditional non-adaptive designs, our novel GI modified design shows operating characteristics comparable in learning (e.g. statistical power) but substantially better in earning (e.g. direct benefits). This illustrates the potential that designs using a GI approach to allocate participants have to improve participant benefits, increase efficiencies, and reduce experimental costs in adaptive multi-armed experiments with exponential rewards.

translated by 谷歌翻译

Forecasting through deep learning and modal decomposition in multi-phase concentric jets

León Mata , Rodrigo Abadía-Heredia , Manuel Lopez-Martin , José M. Pérez , Soledad Le Clainche

分类：机器学习

2022-12-24

This work presents a set of neural network (NN) models specifically designed for accurate and efficient fluid dynamics forecasting. In this work, we show how neural networks training can be improved by reducing data complexity through a modal decomposition technique called higher order dynamic mode decomposition (HODMD), which identifies the main structures inside flow dynamics and reconstructs the original flow using only these main structures. This reconstruction has the same number of samples and spatial dimension as the original flow, but with a less complex dynamics and preserving its main features. We also show the low computational cost required by the proposed NN models, both in their training and inference phases. The core idea of this work is to test the limits of applicability of deep learning models to data forecasting in complex fluid dynamics problems. Generalization capabilities of the models are demonstrated by using the same neural network architectures to forecast the future dynamics of four different multi-phase flows. Data sets used to train and test these deep learning models come from Direct Numerical Simulations (DNS) of these flows.

translated by 谷歌翻译

Monte Carlo Techniques for Addressing Large Errors and Missing Data in Simulation-based Inference

Bingjie Wang , Joel Leja , Ashley Villar , Joshua S. Speagle

分类：机器学习

2022-11-07

Upcoming astronomical surveys will observe billions of galaxies across cosmic time, providing a unique opportunity to map the many pathways of galaxy assembly to an incredibly high resolution. However, the huge amount of data also poses an immediate computational challenge: current tools for inferring parameters from the light of galaxies take $\gtrsim 10$ hours per fit. This is prohibitively expensive. Simulation-based Inference (SBI) is a promising solution. However, it requires simulated data with identical characteristics to the observed data, whereas real astronomical surveys are often highly heterogeneous, with missing observations and variable uncertainties determined by sky and telescope conditions. Here we present a Monte Carlo technique for treating out-of-distribution measurement errors and missing data using standard SBI tools. We show that out-of-distribution measurement errors can be approximated by using standard SBI evaluations, and that missing data can be marginalized over using SBI evaluations over nearby data realizations in the training set. While these techniques slow the inference process from $\sim 1$ sec to $\sim 1.5$ min per object, this is still significantly faster than standard approaches while also dramatically expanding the applicability of SBI. This expanded regime has broad implications for future applications to astronomical surveys.

translated by 谷歌翻译

Some performance considerations when using multi-armed bandit algorithms in the presence of missing data

Xijin Chen , Kim May Lee , Sofia S. Villar , David S. Robertson

分类： (统计)机器学习 | 机器学习

2022-05-08

在比较多臂匪徒算法的性能时，通常会忽略缺失数据的潜在影响。实际上，这也影响了他们的实现，在克服此问题的最简单方法是继续根据原始的强盗算法进行采样，而忽略了缺失的结果。我们通过广泛的仿真研究研究了对这种方法的性能的影响，以处理几种强盗算法的缺失数据，假设奖励是随机缺失的。我们专注于具有二元结果的两臂匪徒在患者分配的背景下用于样本量相对较小的临床试验的背景下。但是，我们的结果适用于预计丢失数据的Bandit算法的其他应用。我们评估所得的运营特征，包括预期的奖励。考虑到双臂失踪的不同概率。我们工作的关键发现是，当使用忽略丢失数据的最简单策略时，对多军匪徒策略的预期性能的影响会根据这些策略平衡勘探探索折衷权衡的方式而有所不同。旨在探索的算法继续将样本分配给手臂，而响应却更多（被认为是具有较少观察到的信息的手臂，该算法比其他算法更具吸引力）。相比之下，针对剥削的算法将迅速为来自手臂的样品迅速分配高价值，而当前高平均值的算法如何，与每只手臂的水平观测无关。此外，对于算法更多地关注探索，我们说明，可以使用简单的平均插补方法来缓解缺失响应的问题。

translated by 谷歌翻译